Relevant search rankings using high refresh-rate distributed crawling

ABSTRACT

A system for maximal gathering of fresh information added to a network such as the as the Internet and for processing the gathered fresh information. A link server ( 2 ) sends a batch of links to check ( 3 ) to a crawler ( 1 B). Crawler ( 1 B) them executes its crawling assignment by filtering the encountered content and extracting only that which is new or changed ( 4 ). Crawler ( 1 B) then returns this content ( 4 ) to at least one data center and any interested web mining application ( 5 ). By using the crawlers ( 1 A-E) to filter the data and only return or notify regarding, the fresh content, less bandwidth is needed to get the information to the web mining application ( 5 ).

STATEMENT REGARDING GOVERNMENT FUNDING

[0001] This invention was made under the DARPA Metacomputing projecttitled: “End to End Resource Allocation in Metacomputers”, DARPA/ITO,Contract number G438-E46-2074. The Government may have certain rights inthis invention.

FIELD OF THE INVENTION

[0002] The invention generally relates to a computerized network such asthe Internet or World Wide Web (“WWW”), and more particularly, tomaximizing retrieval of content freshly published on the network.

BACKGROUND OF THE INVENTION

[0003] The Internet has become an important computerized network, whichcan be accessed by computer users worldwide. Over the years, the numberof Internet sites and the available content on the Internet has grown.Only in some cases does an Internet user already know the specificwebsite address (e.g., www.______.com) to enter to access a desiredwebsite. Often Internet users want to retrieve certain subject matter,without being able to provide the Internet address(es) (i.e., thedomain) at which such subject matter may be located. Such users want tobe able to have a search of the Internet performed for them to retrievetheir desired subject matter, and conventionally they have done so viaso-called “search engines”, such as www.google.com. For such searching,a user generally starts by visiting the search engine site (e.g.,www.google.com), where the user encounters a field in which to enter hisdesired word(s) or phrase to search.

[0004] An Internet user querying a popular commercial search engine suchas www.google.com or www.altavista.com will get back a listing ofcontent that the search engine deems relevant to the query. While somesearch engines perform this task better than others, all known searchengines before the present invention suffer from a common problem: amajority of the content they return is old. For example, withconventional search engines sometimes the most current content returnedhas been between one and three months old. This datedness is a realproblem for some types of content, such as current events and newsstories.

[0005] Current search engines work from an index to locate Web documentsthat satisfy a specified search criteria. The index is a limiting factorfor the search engine, i.e., if the index only has “old” content, thesearch engine can do no better in returning information to the userquery.

[0006] Preparation and maintenance of such an index conventionally hasbeen accomplished by a “web crawler”, which is a computer program thatautomatically retrieves numerous Web documents from one or more Websites. A Web crawler processes the received data, preparing the data tobe subsequently processed by other programs, such as creation of asearch engine-useable index of documents available on the Internet.

[0007] Conventional crawlers have been proposed and are in use. However,the problem of returning current content continues to go unsolved. Thefault in the current search engine systems of failing to return currentcontent arises from a combination of two problems that have yet to beaddressed.

[0008] The first problem is the slow scan rate at which search enginescurrently look for new and changed information on a network. The bestconventional crawlers visit most web pages only about once a month.Because the scan rates of these conventional search engines are so slowthere is no way for them to capture a majority of all the fresh contentthat is available. One reason why these crawlers scan so slowly is theirdependence on a centralized crawling method where all of the crawlerscrawl from a small number of sites on the network. This set-up causes alot of the downloaded information to traverse the same network pipe. Toreach high network scan rates on the order of a day with such anapproach would be impractical, for requiring too enormous an amount ofbandwidth flowing to a small number of locations on the network, becausethe cost would not be economically feasible. Due to such economics, mostsearch engines have a scan rate of much slower than once a day.

[0009] A second problem that occurs is that current search engines donot incorporate new content into their “rankings” very well.Conventional search engines use certain methods to arrive at anAuthority Measure for a page. For example, Google's PageRank rankingtechnology depends on the number of links a to-be-ranked page haslinking to it in order to decide on the weight of the to-be-ranked page.Because new content inherently does not have many links to it, it willnot be ranked very high under Google's PageRank scheme or similarschemes. Thus, for certain search engines, even if the search engineidentifies a site as having relevant content, if the website has fewlinks to it because of its newness, the search engine will rank it lowin the list of retrieved site addresses that the searching user views.Some search inquiries can return thousands or more of addressesresponsive to the request, so that being a low-ranked result of thesearch decreases the likelihood that the searcher will actually view thecontent.

[0010] Like Google, other conventional search engines also derive anAuthority Measure for a page based on the number of links that point tothe page. Thus normally an Authority Measure will be low for new pages.Newly created content, being new, is unknown to most people, and, notknowing about it, people have not put HTML (HyperText Markup Language)links in their documents pointing to it. Under conventional systems, newcontent maintains a low score, until more people find out about it andlink to it.

[0011] Search engines fall into the general category of web-miningapplications. These applications collect and extract large amounts ofdata from the web, for further processing. In the case of searchengines, this further processing is the construction and maintenance ofsearchable indexes. Many other processing methods can be performed onthis data. Examples of such applications include event notificationsystems, market analysis, corporate intelligence, etc. The field ofdata-mining is closely related to web-mining: in data-mining, data isusually processed from a database, whereas in web-mining, data isprimarily processed from information on the web. The architecture ofconventional web mining applications is shown in FIG. 5(a). Conventionalweb mining applications use polling methods, in which the applicationsmust continually poll the data available on the network (such as theInternet) to determine what is there and what is changed.Conventionally, all data/web mining, including search engines, corporateintelligence, etc., have been using polling methods. Practicallyspeaking, the conventional methods provide for visiting pages in setlists now-and-then, and seeing what is in the pages. The amount of workto be done when using such polling methods is extraordinarily large.

[0012] Somewhat separate from the development of the above-mentionedInternet searching technology, so-called “metacomputer” technology hasbeen developing. The idea of a metacomputer was first popularized by theSeti@home project in 1996, relating to searches for Extra Terrestrialsby scanning the sky for intelligent radio signals originating outsidethe solar system. Metacomputers then developed for more generalizeduses.

[0013] A metacomputer system manages and contains a large number ofmachines (managing servers, and the contributor nodes). Together, thesystem created is a powerful virtual computer. In a metacomputer, likeany computer, there is an operating system, and the applications thatrun on top of the operating system. In the original use by the Seti@homeproject, the application and operating system (“OS”) were combined, andonly the seti application could run on their system. Starting in aboutthe first half of 2000, many companies took up this idea, creating suchvirtual computers on which people could run their distributedapplications.

[0014] Such metacomputers require operating systems, and the ShareSystem developed at Johns Hopkins by Jacob Green and John Schultz was anearly development of such a virtual computer. Green et al have publishedinformation about the Share System, e.g., at www.cnds.jhu.edu. Ametacomputer such as Share has a two-component basic architectureconsisting of the Contributor Environment (CE) which runs oncontributors' machines, and the Allocation Servers (AS) that hand outjobs to the CEs. Another such metacomputer was constructed by what wasformerly known as PopularPower before March 2001 when it went out ofbusiness under that name. Another metacomputer is that of DistributedScience.net (created when ProcessTree merged with Dcypher.net). Otheroperating systems include Entropia.com, AppliedMeta.com, UnitedDevicesand DataSynapse.

[0015] Eichstaedt et al. in U.S. Pat. No. 6,182,085 issued Jan. 30, 2001for “Collaborative Team Crawling: Large Scale Information Gathering Overthe Internet”, recognize the need to make a crawler (gatherer) moreefficient, and provide a method using multiple processors forcollaborative web crawling and information processing. They use a set ofcrawlers running at the same location. However, a need still remains forsystems for maximally retrieving, indexing, rating and making availablecurrent content on a network.

[0016] U.S. Pat. No. 6,151,624 to Teare et al., issued Nov. 21, 2000,entitled “Navigating Network Resources Based on Metadata”, provides forthe crawler to execute every 24 hours. (Column 17.) The crawler pollsWeb sites on the Internet to locate customer sites that have updates,and a database is updated. (Column 18.) Although the crawler iscommanded to execute every 24 hours, index files are only updated weeklybased on the database. (Columns 17-18).

[0017] Thus, improved technology is needed for successfully gatheringfresh content from a network such as the Internet especially that canoperate without getting bogged down by the vast amount of unchangedcontent on the Internet. Also, there remains a need for technology toeffectively rank new content.

SUMMARY OF THE INVENTION

[0018] It therefore is an object of this invention to provide a methodof expediting retrieval of new content published on the Internet oranother network.

[0019] It is a further object of the invention to provide a method ofusing a distributed crawling system for efficiently and promptlygathering new content published on the Internet or another network.

[0020] Additionally it is an object of the present invention to expeditethe time in which new published content on the Internet or anothernetwork may be accessed by a user posing a content-based query, or webmining application.

[0021] Additionally, it is an object of this invention to provide amodel for creating and maintaining web mining applications (like searchengines) based on update notifications, and not needing to rely oncontinual polling of the data.

[0022] Additionally, it is an object of this invention to provide aplatform on which many web mining applications can listen to a commonlyavailable set of update notifications to build their application andstay current.

[0023] In order to accomplish these and other objects of the invention,the present invention in a preferred embodiment provides systems forprocessing fresh information added to a network, such as a systemcomprising, for a network, identifying fresh information added to thenetwork; and presenting the fresh information as a stream of events. Theinvention may be used wherein the network is the Internet or intranet.

[0024] In a particularly preferred embodiment, the invention providesfor the stream of events to be made available for concurrent use by aplurality of web-mining applications.

[0025] The invention further provides for the system optionally toinclude rating the fresh information.

[0026] In another embodiment of the inventive system, the freshinformation identification may be by a metacomputer deployed to identifyfresh information.

[0027] In another embodiment, the invention provides a method ofgathering information freshly available on a network, such as a methodcomprising deploying a metacomputer to gather information freshlyavailable on the network, wherein the metacomputer comprisesinformation-gathering crawlers instructed to filter old or unchangedinformation.

[0028] The invention in another embodiment provides that theinformation-gathering method may include deploying a distributed systemof crawlers. A further embodiment of the invention provides forcommanding the crawlers to encounter content on the network and tofilter encountered content for freshness. Another embodiment of theinvention provide for the filter of encountered content for freshness tocomprise instructions to filter old or unchanged information and togather only information on the network that is new or changed. In theinvention, in a particularly preferred embodiment of using the crawlers,the crawlers sit on a plurality of machines across the network.

[0029] In another embodiment, the invention provides for using ametacomputer that includes at least one link server for receivingcontent from the crawlers. In a particularly preferred embodiment of theinvention, crawlers are commanded to return only the fresh encounteredcontent to the link server. The invention in a further embodimentprovides that data is compressed before being sent by a crawler.

[0030] In another preferred embodiment, the invention also provides ahigh scan rate, decreased bandwidth method for data delivery, such as amethod comprising: (A) providing at least one coordinating Link Serverto direct a plurality of crawlers through low bandwidth commands, (B)providing that when a crawler is instructed by the Link Server to checka page link, for the to-be-checked page link the crawler also is toldinformation including URL name, last time checked, and a last crawl datepage digest from when the link was last checked; (C) connecting acrawler to the to-be-checked page and commanding the crawler to read aheader of the to-be-checked page, and (1) commanding the crawler that ifthe to-be-checked page header returns a last modified date, the crawlercheck the page against the last crawl date associated with theto-be-checked page; further provided that: (i) for a to-be-checked pagefound to be unchanged, the crawler bypasses and does notdownload/process the to-be-checked page; but (ii) if the to-be-checkedpage is found to have changed since the last checked time, the crawlernotifies the Data Center that the to-be-checked page has been changed,downloads, processes, compresses and sends the to-be-checked pagecontent to the Data Center, (2) commanding the crawler that if no lastmodification date is found in the to-be-checked page header, the crawlerdownloads the page, and then runs the downloaded page through a functionat the crawler to obtain a new page digest for matching against a lastcrawl page digest, if any, provided that: (i) if and only if the newpage digest can be matched to a last crawl page digest, the crawlerproceeds to the next link to be checked; but (ii) if for the new pagedigest no matching last crawl page digest is found, the crawler thennotifies the Data Center and/or transmits the new page digest to theData Center, further provided that the crawler returns the linksoriginally received from the Link Server with updated digests and crawltimes.

[0031] The invention also in a further embodiment provides for a methodwherein whenever the crawler downloads a page determined to be new orchanged, the crawler optionally extracts the links on the downloadedpage and reports the extracted links to the Link Server. Such methodsaccording to the invention in a further embodiment may includeidentifying if extracted links are valid by commanding the crawlers toattempt to connect to the extracted links from a downloaded page. Theinvention also provides for methods including commanding the crawler,once connected, to also filter out the links and only extract and returnHTML/TEXT links.

[0032] The invention in a further preferred embodiment provides methodsthat include information processing by the crawlers on the downloadedpages. In such inventive methods, the information processing may bestripping out HTML tags and using information retrieval and/or naturallanguage processing techniques to characterize the document.

[0033] In another embodiment, the invention provides for methods thatinclude updating Link Server records on the links and scheduling themfor later crawling or re-crawling, such as management by the Link Serverof link assignments for crawling. In a further preferred embodiment, theinvention provides for management by the Link Server to compriseassigning network-wise close links to a crawler and/or arranging forrelatively more frequent crawling of links from domains with trackrecords of frequent change.

[0034] In a further preferred embodiment, the invention provides methodsand systems in which the Data Center, upon receiving new or changedcontent conducts at least one of the following: (a) storage of the newor changed content; (b) storage of only delta changes of a page; (c)data mining; (d) data processing; (e) application of data to at leastone search engine; (f) intelligent caching.

[0035] The invention also provides methods of processing new informationon a network and of rating gathered fresh information, such as a methodcomprising: (A) for information encountered on the network that is newrelative to a data base of existing content, identifying at least oneexisting document within a predetermined distance from the newlyencountered information; and, (B) identifying an already-establishedweight of the at least one existing nearby document identified accordingto step (A). The inventive method in a further embodiment may include,for the newly encountered information, assigning a weight measurementpartially based on the already-established weight(s) identified in step(B) of the at least one existing nearby document. A particularlypreferred embodiment of the invention provides time-adjusted weightingof the new information, such as time-adjusted weighting of the newinformation comprising assigning a time dependent function to theassigned weight measurement, wherein as the new information ages, lessweight based on the at least one existing nearby document is accordedthe new information.

[0036] The invention in another preferred embodiment provides a rankingmethod for new or changed content on a network, such as a methodcomprising partially ranking the new or changed content based on atleast one neighboring page. In a further embodiment, the inventionprovides a ranking method wherein the partial ranking of a new page Xwith a URL of form http://www.xyz.edu/a/b/c/d/X.html, wherein “xyz” maybe any domain name, “.edu” may be any web suffix including but notlimited to .com, .net and .tv and a, b, c and d are variables, comprisesassigning a Temporary_Authority_Measure based on at least oneAuthority_Measure of at least one page in the same a/b/c/d/ directory orin a page that is a predetermined distance from the new page. In anotherembodiment, the inventive ranking method includes reducing the effect ofany neighboring page with time. In a further embodiment, the inventionprovides a method wherein the ranking method includes a time-dependentreduction of the Temporary₁₃ Authority_Measure.

[0037] The invention in another preferred embodiment also providescomputer-readable information, such as computer-readable informationproduced (A) from a stream of events comprising fresh informationidentified for a network; or (B) by deploying a metacomputer to gatherinformation freshly available on the network, wherein the metacomputercomprises information-gathering crawlers instructed to filter old orunchanged information. The invention in additional preferred embodimentsprovides a computer data base of such computer-readable information, andan index prepared from such a computer data base. The invention inanother preferred embodiment provides an electronic library wherein thelibrary consists essentially of such an index. In another preferredembodiment, the invention provides a computerized search engine whereinthe search engine queries an index prepared from such computer-readableinformation.

[0038] The invention in another preferred embodiment provides adistributed system of crawlers returning content from a network to alink server, wherein each crawler: (1) minimizes time spent on old andunchanged content; (2) filters and excludes from returning old orunchanged content to the link server; and (3) gathers and returns freshcontent to the link server.

[0039] Also, in another preferred embodiment, the invention provides amonitoring method for at least one web mining application, such as amethod comprising screening web documents for changed content, whereinthe screening occurs in a system external to the web mining application.In a particularly preferred embodiment, the inventive monitoring methodincludes, in the external system, locating changed content and preparinga stream of updates characterizing the changed content. In anotherparticularly preferred embodiment, the inventive monitoring methodincludes providing the stream of updates to the at least one web miningapplication. In a further embodiment, the inventive monitoring method isone wherein the stream of updates is provided to multiple web miningapplications. Another inventive embodiment provides a monitoring methodwherein the stream of updates is simultaneously useable by the multipleweb mining applications.

[0040] A further embodiment of the invention provides a monitoringmethod wherein the screening includes applying a change filter toprohibit unchanged web documents and other repetitive content fromreaching the web mining application In a particularly preferredembodiment, the invention provides a monitoring method wherein thechange filter comprises a data center cooperating with anetwork/metacomputer system.

BRIEF DESCRIPTION OF THE DRAWINGS

[0041] The foregoing and other objects, aspects and advantages will bebetter understood from the following detailed description of thepreferred embodiments of the invention with reference to the drawings,in which:

[0042]FIG. 1(a) is schematic diagram of a high scan rate architecturethat is a distributed system according to the present invention.

[0043]FIG. 1(b) is a schematic diagram of interaction between componentsof a high scan rate architecture system according to FIG. 1(a).

[0044]FIG. 2 is a schematic diagram of changed and new content testingaccording to the present invention.

[0045]FIG. 3(a) is a flow chart for a crawling system according to theinvention, in the context of a share allocation server and metacomputer.FIG. 3(b) is a flow chart for a crawling system according to theinvention, in the context of the Internet.

[0046]FIG. 4 is a flow chart of a stream of update created by a crawlingsystem according to the invention, and the web-mining applications thatare built by and stay current based on the invention.

[0047] FIGS. 5(a) and 5(c) each is a flow chart of a conventionalpolling model. FIGS. 5(b) and 5(d) is a flow chart of an event drivenmodel which is an example according to the present invention.

[0048]FIG. 6 is a graph of resources versus number of web miningapplications, contrasting a conventional polling model with an exampleof an inventive model.

[0049]FIG. 7 is a flow chart of a monitoring system according to theinvention.

[0050] FIGS. 8(a) and 8(b) are graphs of a temporary authority measure(TAM) according to the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS OF THE INVENTION

[0051] In a first preferred embodiment, the invention uses a distributedsystem of crawlers to efficiently gather new and changed publishedinformation on a network. In a most preferred embodiment, thedistributed system is a metacomputer system. Examples of a suitablemetacomputer include the Share Metacomputer developed by Green et al. atthe Johns Hopkins University, and the systems of PopularPower.com,processtree.com, DistributedScience.net, Entropia.com, AppliedMeta.com,UnitedDevices, DataSynapse, and the like.

[0052] A metacomputer operating system suitable for use in the presentinvention is made up of many participating computers on the network(such as the Internet). The metacomputer consists of the ContributorEnvironments (CE) and the Allocation Servers (AS). The CE is a softwareapplication that is installed on a contributor's computer, which can beany modem computer with a network connection. The CE runs as a lowpriority background process, to minimize the impact on performanceexperienced by a normal user of the computer, and to basically harnessonly the unused resources of the contributor's computer. The task of theCE is to download, monitor and run jobs given to it. The AS coordinatesthe efforts of all the nodes in the metacomputer. The allocationserver(s) hand out jobs to the CEs, trying to efficiently use thecombined resources of all the nodes. The AS has the ability to add orremove any job running on a node, and to upgrade the CE application tothe newest version.

[0053] The present invention uses a metacomputer with a suitableoperating system, such as Share combined with the distributed crawlingsystems such as that of the HyperDog System; the distributed crawlingsystem in U.S. Pat. No. 6,182,085 issued Jan. 30, 2001 to Eichstaedt etal. of IBM, entitled “Collaborative Team Crawling: Large Scaleinformation Gathering Over the Internet”; or the distributed crawlingsystem of S. Brin and L. Page, “The Anatomy of a Large-ScaleHypertextual Web Search Engine,” the 7^(th) International World Wide WebConference (WWW7), 1998.

[0054] In a preferred embodiment of the invention, the operating systemis the Share system, a virtual OS that runs the metacomputer. The Sharesystem preferably is used with the HyperDog system, a distributedcrawling system that is an example of a system that works well on ametacomputer. The HyperDog system is an example of a system thatdistributes the crawlers to many points in the network through ametacomputer, so that the crawling algorithm gains advantages such asbandwidth savings to the central indexing point. The crawlers filter thepages they crawl and only return back the pages they perceive as new orchanged since their last visit. Because only a small portion of the WWWchanges each day, the changes sent back to the central indexer aregreatly reduced compared to conventional crawler technology. Thecrawlers further save bandwidth by compressing their communications withthe indexers.

[0055] Using a metacomputer with a distributed crawling system accordingto the invention provides at least several advantages. When scan ratesare high (on the order of daily), a savings of about two orders ofmagnitude can be achieved. Also, using a metacomputer system providesease of manageability and use. Individual machines in a metacomputer areself-managed and administered, and are normally replaced or upgraded ona regular schedule. To the centralized indexer, the metacomputerrepresents a black box that feeds the indexer the changes that areoccurring on the WWW, without having to manage the crawlers or theinfrastructure on which they run. The indexers experience the eventdriven model where event notifications are received through a stream ofchanges from the black box.

[0056] A system such as HyperDog gives the indexers all the updates thatoccur without overwhelming the indexers with redundant pages that havenot changed. Because the recrawls are done in a distributed fashion,greater speed is achieved through parallelization. Initially the speedwill be relatively slow because everything will be “new” or “changed” tothe crawlers. In this initial phase the crawlers will be discovering theexisting content on the WWW. After the initial phase, a system likeHyperDog will reach a relative steady state, where only the normalgrowth of the WWW will be reported.

[0057] By using such a system and generating a stream of updatenotifications, many indexing or web-mining applications can share thecosts of crawling. Unlike many of the conventional web-miningapplications, which effectively reproduce each others' same crawlingwork, the stream of updates according to the present invention allowsfor shared commonalities, with each application customizing the data toits own differing web-mining needs.

[0058] The operating system, of which Share is an example, manages allthe nodes of the metacomputer. The operating system leases time on thevirtual computer to companies (or others) that have applications thatthose companies want to use. The system of the present invention is suchan application for desired use (called HyperDog). The crawlers in thecrawling system of the invention get run on nodes of the metacomputer.It is the metacomputer's responsibility to hand out the crawlingapplication to its nodes, to make sure they are up and running, and tomake sure that the nodes are not being harmed by the application.

[0059] This distributed system of crawlers may sit on many machinesacross the network, as depicted in FIG. 1(a), as in a preferredembodiment of the present invention. With reference to FIG. 1(a),crawlers 1 a, 1 b are shown. Crawlers 1 a, 1 b etc. may be, are notnecessarily required to be, the same. Crawlers 1 a through 1 f are shownin FIG. 1(a) by way of example, but it will be readily appreciated thatthe number of crawlers is not limited.

[0060] As shown in FIG. 1(a), a preferred embodiment of the inventionuses at least one Link Server 2. One or more LSs may be added inaddition to LS 2. The LS 2 is selected to obtain a high scan rate, by adecrease in bandwidth to a web-mining application, and the LS is capableof directing the crawlers through low bandwidth commands. The LS isprogrammed for sending a link or batch of links to a crawler selected bythe LS. The LS further is programmed for including in each such link theto-be-checked URL name, and, if applicable, the last time it was checkedand a page digest from when it was last checked.

[0061] A suitable LS for use in the present invention is one that isprogrammed to update its records on links and schedule them for latercrawling. A Link Server can “hand out” links to a crawler blindly butmore preferably the Link Server will be “smart” about the hand-outs.Namely, the Link Server preferably tries to hand out links to crawlersthat are “network-wise” close to each other, send out links morefrequently from domains with track records of frequent change, andotherwise optimize crawling efforts.

[0062] In FIG. 1(a), Link Server 2 is shown communicating with crawler 1b. However, it will be appreciated that Link Server 2 communicates withother crawlers, such as crawlers 1 a, 1 c. Additionally, a crawler isnot necessarily restricted to communication with a single Link Server,but generally a crawler is in communication with one Link Server, suchas crawler 1 b being controlled by Link Server 2. A crawler isprogrammed for connecting to a to-be-checked page and reading the pageheader. A crawler further is programmed that if the page returns a lastmodified date, the crawler is to check this date against the last crawldate information associated with this link. A crawler is programmed sothat if the page has changed since the last time it was checked, thecrawler will either let the LS and web-mining applications know that ithas changed, or download, process, and compress the page content forsending to the LS and web-mining applications.

[0063] A suitable crawler for use in the present invention is one which“filters” to-be-checked pages to minimize time spent on unchanged pagesand to minimize or eliminate downloading and sending of unchanged pages,while recognizing and gathering fresh information.

[0064] One example of a crawler that minimizes time spent on anunchanged page, by first being programmed so that if the crawler findsfrom the page header that the page has not changed, the crawler willproceed immediately to the next page. Second, the crawler is programmedthat if the web server serving a to-be-checked page does not report thedate the content was last modified, the crawler will download the page,and run the downloaded page a function at the crawler to get a digest ofthe downloaded page. The crawler is further programmed that if such anew digest matches the digest from the last crawl of the page, then thecrawler is to assume the page to be unchanged and proceed to the nextlink.

[0065] One example of a crawler that recognizes fresh content is acrawler programmed that upon finding that a crawler-prepared digest doesnot match the digest provided to the crawler, then the crawler is toassume that the page has changed and is to notify and/or send the datato the LS. A crawler may batch up these return notifications to be sentback to the LS. Such batching-up reduces the communications overhead ofreestablishing any communication channels between a crawler and a LS.

[0066] Preferably, a crawler used in the present invention is one that,upon downloading a page and determining that it is new or changed, alsowill extract the links on the downloaded page for reporting back to theLS. Also, preferably, a crawler for use in the present inventionpreferably is one that will return the links originally received fromthe LS, with updated digests and crawl times.

[0067] Modifications and variations on to the crawlers mentioned aboveby way of example may be made, within the present invention.

[0068] In the invention, a plurality of a single type of crawler may beused, or, different crawlers may be used in various combinations.

[0069] An example of a relationship between a link server 2, crawler 1 band web mining application 5 in FIG. 1(a) may be appreciated withreference to FIG. 1(b). The crawler 1 b requests (1001) the Link Server2 for an assignment of URLs to check. The link server 2 sends a batch ofURLS (1002) to the crawler 1 b. The crawler 1 b checks the URLs asrequested to determine which, if any, have changed. The crawler 1 breturns the updated state of the URLs and any content changes (1004) tothe link server 2. The link server 2 updates the state of the returnedURLs and broadcasts (1006) to the web mining application 5 URL updatesfor those URLs that have changed. The broadcast updates are used by theweb mining application 5 to keep web mining applications running. Thecycles of URL requests, batch sending and returning (1001, 1002, 1004)preferably repeat continuously. Request initiation (1001) is by crawler1 b in FIG. 1(b), but may be by any crawler seeking work to perform.

[0070] An example of a system, in operation, according to a preferredembodiment of the invention, having a LS and crawlers as set forthabove, is as follows, discussed with reference to FIGS. 1(a), (b) and 2.

[0071] Link Server 2 sends a batch of links to check 3 to a crawler 1 b.According to the batch 3, crawler 1 b then executes its crawlingassignment, during which the crawler 1 b encounters content. The crawler1 b filters the encountered content and extracts only that which is newor changed, and then returns this content 4 to at least one data centerand preferably any interested web mining applications 5.

[0072] Web mining application 5 may be a database or similar storagecenter that can process the data at web mining application 5. One ormore additional web mining applications may be used. The content 4provided by the crawlers 1 b etc. includes content, processed contentand new content notification.

[0073] By using the crawlers 1 b etc. to filter the data and only returnor notify regarding the fresh content, less bandwidth is needed to getthe information to the web mining application 5 (from the viewpoint ofthe web mining application 5). Compressing the data from the crawlers 1b to the web mining application 5 further reduces the bandwidth costs.In turn, reducing the bandwidth requirements of a web mining applicationincreases the rate at which a web mining application can “scan” or findout about new information on the network. This increased scan rateallows each web mining application to learn about new informationquickly.

[0074] Again with reference to FIG. 1(a), in the invention the crawler 1b returns to Link Server 2 old links with new digests and crawl dates 6and new and dead links 7 that the crawler 1b has found.

[0075] A preferred embodiment of the invention also maybe appreciatedwith reference to FIG. 2, showing functions performed by a crawler 1 bfrom FIG. 1(a). Namely, after getting a link from the link server 100,the crawler gets a link header 101. If the crawler finds no lastmodification date on the header 103, the crawler downloads the page 105.For the downloaded page, the crawler checks whether the CRCs match 109or do not match 106. If the CRCs do not match 106, the page isprocessed, the crawler performs an updating process 107 of updatinglinks for the CRC and for crawl date, and the crawler 1 b returns thelink and processed page to the Link Server 2 and gets the next link 108.

[0076] If, however, after getting the link header 101, the crawler findsa last modified date available on the header 102, the crawler gets theheader date 110 and determines if the header date is older or newer thanthe last crawl date. If the header last modified date is older than thelast crawl date 111, the crawler treats the page as having no change andgets the next link 112.

[0077] If the crawler finds that the header date has a last modifieddate newer than the last crawl date 113, the crawler performs an update107 and returns the link and processed page and gets a next link 108.

[0078] Particularly, it will be appreciated that in the case where thelink has never been crawled by a crawler in the crawling system, thelink will lack a digest and last crawl date, and so the crawler thatencounters such a link will download the link and report the informationback to the LS 2.

[0079] In such a system as discussed above, it will be appreciated thatthe web mining applications 5 receive new information without competingold information. Thus, the web mining applications 5 can process the newinformation with fewer resources—thus more quickly and efficiently—thanif the new information had accompanying old information. Theevent-driven model thus provides superior results over, and isdistinguished from, the polling model.

[0080] It will be appreciated that the above activities have beendiscussed for a single crawler while the system is in operation, butthat meanwhile each of the crawlers in the system is proceeding likewiseon its own with regard to other pages.

[0081] Moreover, there are a number of further activities thatoptionally may be performed by the crawlers. For example, the crawlersoptionally can attempt to connect to the links that are extracted from adownloaded page. This activity will identity if the links are valid.Another optional activity is that, once connected, the crawler canfilter out the links and only extract and return links of a certaintype, such as only links of the MIME type HTML/TEXT. Also, the crawlersoptionally can perform information processing on downloaded pages tosave on the computation time by the-web-mining applications. Processingexamples include: stripping out HTML tags; using information retrievaland/or natural language processing techniques to understand the subjectmatter of the document and/or categorize the document.

[0082] It will be appreciated that the a web-mining application is notlimited in its use of the received data and may use the received data inmany ways. When a web-mining application is notified or receives new orchanged content, some of the options for proceeding include: store thisinformation; store only the delta changes of a page (like a softwareversion control system); mine the data; process the data for use by asearch engines; provide intelligent caching, and the like.

[0083] Where more than one web mining application is provided, the webmining applications may be combined and used together as is known tothose working with computer systems.

[0084] With reference to FIG. 3(a), an example of using the invention inthe context of a metacomputer system such as Share is shown. Eachcrawler has a respective contributor environment, such as contributorenvironment 8 a for crawler 1 a. Each crawler is in communication withone or more web servers, such as crawler 1 a in communication with webservers 9 and 9′, and crawler 1 c in communication with web server 9′″.Each contributor environment 8 a, 8 b, 8 c is controlled by the shareallocation server 10. Each crawler 1 a, 1 b, 1 c is in communicationwith the HyperDog system of LinkServers and web mining applications 11,an example of a link server being link server 2 in FIG. 1, and anexample of a web mining application being web mining application 5 inFIG. 1.

[0085] Referring to FIG. 3(b), an example of using the invention in thecontext of the Internet is shown. In this example, a plurality ofcontributor environments (CE) are shown, such as CE 8 a (also shown onFIG. 3(a)). The CEs such as CE 8 a are dispersed on the Internet 12.Each CE, such as CE 8 a, is in communication with the allocation server10.

[0086] The present invention makes possible dissemination of updatenotifications to listening web-mining applications. Such disseminationof update notifications may be performed, by way of example, as shown inFIG. 4. FIG. 4 shows a stream of update created by a crawling systemaccording to the invention, and the web-mining applications 5 that arebuilt by and stay current based on the invention. In the crawling systemas shown in FIG. 4, which is a non-limiting example according to theinvention, a data center 13 is provided. The data center 13 comprises aplurality of link servers 2 and a plurality of web mining applications(WMA). The WMAs do not necessarily have to reside within a datacenter,as shown in FIG. 7, and can be owned and operated by third parties thatuse standard communications to listen to the stream of updates. The datacenter 13 is in communication with a plurality of crawlers. A pluralityof crawlers 1 a, 1 b, 1 c are in communication with a plurality of webservers 9. The crawlers 1 a, 1 b, 1 c return information to a pluralityof link servers 2, which in turn provide their new information to aspreading system 14 which in turn selectively provides new informationto a plurality of web mining applications 5. In the exemplary spreadingsystem as shown in FIG. 4, the spreading system 14 receives all newinformation. The spreading system 14 controls the distribution of thenew information so that each web mining application 5 receives desirednew information but not old information and not new information that isunwanted by the particular web mining application 5.

[0087] In another embodiment, the invention provides a monitoringsystem, an example of which is shown in FIG. 7. The monitoring systemprovides for a plurality of crawlers 1 b to communicate with a datacenter 13 through a network/metacomputer system 17. From the receivedinformation, the data center 13 can produce a changed link stream 18 anda changed content stream 19. The changed link stream 18 may be providedto selected web mining applications, such as WMA 5. The changed contentstream 19 may be provided to selected WMAs, such as WMA 5′.

[0088] Advantageously, the invention makes possible an event drivensystem for web mining applications, so that the cumbersome conventionalpolling systems may be avoided. Advantages of an event driven modelaccording to the invention, contrasted with a conventional pollingmodel, may be seen with reference to FIGS. 5(a) and 5(b). FIG. 5(a)depicts a conventional polling model, in which the WWW is sampled by aplurality of applications, App 1, etc. In the polling model of FIG.5(a), each of the applications separately downloads and processes alldocuments, leading to much duplication and repetition of effort.

[0089] In an event driven model according to the invention, an exampleof which is shown in FIG. 5(b), a HyperDog system performs polling anddownloading, filters the content, and provides selected content (i.e.,changes) to the applications. The applications thus are only faced withprocessing a small percentage of content on the WWW, namely, changedcontent.

[0090] Advantages of the invention when event driven web mining is usedmay be seen with reference to the contrasting flow charts of FIGS. 5(c)and 5(d). In the conventional polling methods, a web document 15 isdelivered in its entirety to a web mining application 5, as shown inFIG. 5(c). Advantageously and by contrast, the invention makes possiblethe delivery of information in more useful form to the web miningapplication 5, as shown by way of example in FIG. 5(d). For example, inthe invention, a web document 15 may be processed by a change filter 16and sent to web mining application 5 in processed form rather than inits entirety. The invention makes possible the processing (such as bychange filter 16) of some or all documents before they are provided to aweb mining application. Change filter 16 may be used in the invention toprocess any document before it reaches the web mining application.Change filter 16 may be used to filter unchanged documents so that onlynew or changed web documents can reach the web mining application 5.

[0091] With reference to FIG. 6, the advantages of the present inventionmay be further appreciated, by considering resources expended versusbandwidth. On the x-axis, number of web mining applications (WMAs) isplotted. On the y-axis, resource usage in bandwidth is plotted. For aconventional polling model (-[]-), an increase in the number of WMAsresults in a directly proportional increase in the bandwidth. For aHyperDog model which is an example according to the invention, there isno difference in the bandwidth regardless of the addition of WMAs, i.e.,increasing the number of WMAs does not increase the bandwidth needed.

[0092] It will be appreciated as set forth above, particularly withregard to the figures but without limitation thereto, that the inventionresults in the location of much new information on the network. In afurther aspect, the invention provides ways to process the newinformation, such as to assign weight or importance. New informationthat is found receives a measurement of weight partially based on theweight of “nearby documents”. The definition of “nearby document” isbased on the URL (Universal Resource Locator) structure of theinformation. To further enhance this weighting process, a time dependentfunction can be applied to this weight. Thus, as the content ages, itsportion of weight gained from the documents around it will decrease withtime.

[0093] Together, aspects of the invention are combined to provide asystem that gathers more information than previous systems, continuallyprovides access and notification of fresh information, and is able torate the importance or relevance of this fresh information.

[0094] Another aspect of the present invention is ranking of new orchanged content. Namely, once new content is found, it must be ranked onits importance or relevance. This ranking is mainly useful for searchengines, but has other uses as well.

[0095] To accomplish this relevance rating, the page having new contentis partially ranked on the authoritativeness of its neighboring pages.The measure of “neighboring” is based on the URL structure of the pages.For a new page X with a URL of http://www.xyz.edu/a/b/c/d/X.html, itsweight will have a component that is based on the weight of pages in thesame “/a/b/c/d” directory, where those neighboring, existing pagesalready have associated with them an Authority₁₃ Measure conventionallyderived.

[0096] The neighboring page-based ranking component, newly introduced bythe invention and for applying to a page on a network such as theInternet with new content, is called the Temporary₁₃ Authority₁₃ Measure(TAM). For example if there were two pages in the “/a/b/c/d/” directory,X.html and Q.html, and Q had a Authority₁₃ Measure of 100, then X wouldhave a TAM of 80.

[0097] If Q was not in the “/a/b/c/d/” directory, but was in the“a/b/c/” or “a/b/c/d/e/” directory then Q would have a TAM of 20.

[0098] As the distance in directory structure decreases, so does thecontributing weight.

[0099] The general formula or table of formulae for such a TAM boost iscustomizable to the search engine. For example, different search enginesmay give different TAMs, or they could have the same TAM measurement butweight the TAM into their overall Authority Measurement differently.

[0100] Thus, an initial TAM for a new page Q can be assigned based on anearby page X already having an Authority₁₃ Measure as set forth above.

[0101] With reference to FIG. 8(a), an example of TAM as a function ofdistance for a TAM according to the invention may be seen. In FIG. 8(a),TAM is plotted on the y-axis versus distance on the x-axis, with timeheld constant. The curve in FIG. 8(a) is an example, and other curvesmay be used. The new document which was used to prepare FIG. 8(a) may behttp://www.abc.com/dog/cat/mouse/squirrel/pet.html. Examples of distancemeasures from the new document are as follows:

[0102] D=1: http://www.abc.com/dog/cat/mouse/squirrel/apple.html

[0103] D=2: http://www.abc.com/dog/cat/mouse/orange.html

[0104] D=3: http://www.abc.com/dog/cat/pear.htnl

[0105] D=3 http://www.abc.com/dog/cat/mouse/wombat/a.html

[0106] D=4: http://www.abc.com/dog/a.html

[0107] D=4: http://www.abc.com/dog/cat/class/us.html

[0108] D=5: http://www.abc.com

[0109] D=6: http://www.abc.com/hello

[0110] Additionally, the invention provides for modifying the TAMmeasurement to reflect time. For example, if Q is new content assignedan initial TAM of 80 based on links around it, then as time passes, theTAM for Q gained from Q's neighbors will decrease. If Q.html is assigneda TAM of 80 when it is first found, then after a week the TAM will havedropped to 60, after another week, the TAM will be down to 40, after amonth or more the TAM on Q is 0. This time-based adjustment ensures thatTAM of the new content does not remain the dominating Authority₁₃Measure forever, and a page eventually gains its Own Authority₁₃ Measurebased on popularity.

[0111] With reference to FIG. 8(b), an example of time-decay of a TAMaccording to the invention may be seen. In FIG. 8(b), TAM is plotted onthe y-axis versus time on the x-axis, with distance held constant. Thecurve in FIG. 8(b) is an example, and other curves may be used.

[0112] The general formula or table of formula for the TAM time-decayfeature is customizable to a particular search engine.

[0113] Those working with computer systems will appreciate, withreference to the above, that embodiments of the present invention may beconstructed with the above information using, as necessary,conventionally available hardware and software, and programmingtechniques.

[0114] Although particular mention has been made above of search engineapplications, it will be appreciated that the systems of the presentinvention are not limited to search engines, but may be used in anysystem that requires crawling or event notification on the state ofinformation on a massive information network (of which the Internet isan important example). The invention provides systems that arecommercially useable in many applications, including but not limited to:enhancing the performance of search engines; gathering information fordata mining applications; and, to information-gathering for “ElectronicLibraries”.

[0115] While the invention has been described in terms of its preferredembodiments, those skilled in the art will recognize that the inventioncan be practiced with modification within the spirit and scope of theappended claims.

We claim:
 1. A system for processing fresh information added to anetwork, comprising: for a network, identifying fresh information addedto the network; and presenting the fresh information as a stream ofevents.
 2. The system of claim 1, wherein the stream of events is madeavailable for concurrent use by a plurality of web-mining applications.3. The system of claim 2, including rating the fresh information.
 4. Thesystem of claim 1, wherein the fresh information identification is by ametacomputer deployed to identify fresh information.
 5. The system ofclaim 1, wherein the network is the Internet or intranet.
 6. A method ofgathering information freshly available on a network, comprising:deploying a metacomputer to gather information freshly available on thenetwork, wherein the metacomputer comprises information-gatheringcrawlers instructed to filter old or unchanged information.
 7. Themethod of gathering information of claim 6, including deploying adistributed system of crawlers.
 8. The method of claim 7, includingcommanding the crawlers to encounter content on the network and tofilter encountered content for freshness.
 9. The method of claim 8wherein the filter of encountered content for freshness comprisesinstructions to filter old or unchanged information and to gather onlyinformation on the network that is new or changed.
 10. The method ofclaim 6, wherein the network is the Internet or intranet.
 11. The methodof claim 7, wherein the crawlers sit on a plurality of machines acrossthe network.
 12. The method of claim 7, wherein the metacomputerincludes at least one link server for receiving content from thecrawlers.
 13. The method of claim 12, wherein the crawlers are commandedto return only the fresh encountered content to the link server.
 14. Themethod of claim 6, wherein data is compressed before being sent by acrawler.
 15. The method of claim 6, including rating gathered freshinformation.
 16. A method of processing new information on a network,comprising: (A) for information encountered on the network that is newrelative to a data base of existing content, identifying at least oneexisting document within a predetermined distance from the newlyencountered information; and, (B) identifying an already-establishedweight of the at least one existing nearby document identified accordingto step (A).
 17. The method of claim 16, including for the newlyencountered information, assigning a weight measurement partially basedon the already-established weight(s) identified in step (B) of the atleast one existing nearby document.
 18. The method of claim 16,including time-adjusted weighting of the new information.
 19. The methodof claim 17, including time-adjusted weighting of the new informationcomprising assigning a time dependent function to the assigned weightmeasurement, wherein as the new information ages, less weight based onthe at least one existing nearby document is accorded the newinformation.
 20. A high scan rate, decreased bandwidth method for datadelivery, comprising: (A) providing at least one coordinating LinkServer to direct a plurality of crawlers through low bandwidth commands;(B) providing that when a crawler is instructed by the Link Server tocheck a page link, for the to-be-checked page link the crawler also istold information including URL name, last time checked, and a last crawldate page digest from when the link was last checked; (C) connecting acrawler to the to-be-checked page and commanding the crawler to read aheader of the to-be-checked page, and (1) commanding the crawler that ifthe to-be-checked page header returns a last modified date, the crawlercheck the page against the last crawl date associated with theto-be-checked page; further provided that: (i) for a to-be-checked pagefound to be unchanged, the crawler bypasses and does notdownload/process the to-be-checked page; but (ii) if the to-be-checkedpage is found to have changed since the last checked time, the crawlernotifies the Data Center that the to-be-checked page has been changed,downloads, processes, compresses and sends the to-be-checked pagecontent to the Data Center; (2) commanding the crawler that if no lastmodification date is found in the to-be-checked page header, the crawlerdownloads the page, and then runs the downloaded page through a functionat the crawler to obtain a new page digest for matching against a lastcrawl page digest, if any, provided that: (i) if and only if the newpage digest can be matched to a last crawl page digest, the crawlerproceeds to the next link to be checked; but (ii) if for the new pagedigest no matching last crawl page digest is found, the crawler thennotifies the Data Center and/or transmits the new page digest to theData Center, further provided that the crawler returns the linksoriginally received from the Link Server with updated digests and crawltimes.
 21. The method of claim 20, wherein whenever the crawlerdownloads a page determined to be new or changed, the crawler optionallyextracts the links on the downloaded page and reports the extractedlinks to the Link Server.
 22. The method of claim 21, includingidentifying if extracted links are valid by commanding the crawlers toattempt to connect to the extracted links from a downloaded page. 23.The method of claim 21, including commanding the crawler, onceconnected, to also filter out the links and only extract and returnHTML/TEXT links.
 24. The method of claim 21, including informationprocessing by the crawlers on the downloaded pages.
 25. The method ofclaim 24, wherein the information processing is selected from the groupconsisting of: stripping out HTML tags and using information retrievaland/or natural language processing techniques to characterize thedocument.
 26. The method of claim 20, including updating Link Serverrecords on the links and scheduling them for later crawling orre-crawling.
 27. The method of claim 26, including management by theLink Server of link assignments for crawling.
 28. The method of claim27, wherein the management by the Link Server comprises assigningnetwork-wise close links to a crawler and/or arranging for relativelymore frequent crawling of links from domains with track records offrequent change.
 29. The method of claim 20, wherein the Data Centerupon receiving new or changed content conducts at least one of thefollowing: (a) storage of the new or changed content; (b) storage ofonly delta changes of a page; (c) data mining; (d) data processing; (e)application of data to at least one search engine; (f) intelligentcaching.
 30. A ranking method for new or changed content on a network,comprising partially ranking the new or changed content based on atleast one neighboring page.
 31. The method of claim 30, wherein thepartial ranking of a new page X with a URL of formhttp://www.xyz.edu/a/b/c/d/X.html, wherein “xyz” may be any domain name,“.edu” may be any web suffix including but not limited to .com, net and.tv and a, b, c and d are variables, comprises assigning aTemporary_Authority₁₃ Measure based on at least one Authority₁₃ Measureof at least one page in the same /a/b/c/d/ directory or in a page thatis a predetermined distance from the new page.
 32. The method of claim30, wherein the ranking method includes reducing the effect of anyneighboring page with time.
 33. The method of claim 31, wherein theranking method includes a time-dependent reduction of theTemporary_Authority_Measure.
 34. Computer-readable information produced(A) from a stream of events comprising fresh information identified fora network; or (B) by deploying a metacomputer to gather informationfreshly available on the network, wherein the metacomputer comprisesinformation-gathering crawlers instructed to filter old or unchangedinformation.
 35. An index prepared from a computer data base ofcomputer-readable information produced (A) from a stream of eventscomprising fresh information identified for a network; or (B) bydeploying a metacomputer to gather information freshly available on thenetwork, wherein the metacomputer comprises information-gatheringcrawlers instructed to filter old or unchanged information.
 36. Anelectronic library wherein the library consists essentially of an indexprepared from a computer data base of computer-readable informationproduced (A) from a stream of events comprising fresh informationidentified for a network; or (B) by deploying a metacomputer to gatherinformation freshly available on the network, wherein the metacomputercomprises information-gathering crawlers instructed to filter old orunchanged information.
 37. A computerized search engine wherein thesearch engine queries an index prepared from computer-readableinformation produced (A) from a stream of events comprising freshinformation identified for a network, or (B) by deploying a metacomputerto gather information freshly available on the network, wherein themetacomputer comprises information-gathering crawlers instructed tofilter old or unchanged information.
 38. A distributed system ofcrawlers returning content from a network to a link server, wherein eachcrawler: (1) minimizes time spent on old and unchanged content; (2)filters and excludes from returning old or unchanged content to the linkserver; and (3) gathers and returns fresh content to the link server.39. A monitoring method for at least one web mining application,comprising screening web documents for changed content, wherein thescreening occurs in a system external to the web mining application. 40.The monitoring method of claim 39, including, in the external system,locating changed content and preparing a stream of updatescharacterizing the changed content.
 41. The monitoring method of claim40, including providing the stream of updates to the at least one webmining application.
 42. The monitoring method of claim 41, wherein thestream of updates is provided to multiple web mining applications. 43.The monitoring method of claim 42, wherein the stream of updates issimultaneously useable by the multiple web mining applications.
 44. Themonitoring method of claim 39, wherein the screening includes applying achange filter to prohibit unchanged web documents and other repetitivecontent from reaching the web mining application.
 45. The monitoringmethod of claim 44, wherein the change filter comprises a data centercooperating with a network/metacomputer system.